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Abstract. The Information-Geometric Optimization (IGO) has been introduced 
as a unified framework for stochastic search algorithms. Given a parametrized 
family of probability distributions on the search space, the IGO turns an arbi- 
trary optimization problem on the search space into an optimization problem on 
the parameter space of the probability distribution family and defines a natural 
gradient ascent on this space. From the natural gradients defined over the entire 
parameter space we obtain continuous time trajectories which are the solutions of 
an ordinary differential equation (ODE). Via discretization, the IGO naturally de- 
fines an iterated gradient ascent algorithm. Depending on the chosen distribution 
family, the IGO recovers several known algorithms such as the pure rank-/i up- 
date CMA-ES. Consequently, the continuous time IGO-trajectory can be viewed 
as an idealization of the original algorithm. 

In this paper we study the continuous time trajectories of the IGO given the family 
of isotropic Gaussian distributions. These trajectories are a deterministic contin- 
uous time model of the underlying evolution strategy in the limit for population 
size to infinity and change rates to zero. On functions that are the composite of 
a monotone and a convex-quadratic function, we prove the global convergence 
of the solution of the ODE towards the global optimum. We extend this result 
to composites of monotone and twice continuously differentiable functions and 
prove local convergence towards local optima. 



1 Introduction 

Evolution Strategies (ESs) are stochastic search algorithms for numerical optimization. 
In ESs, candidate solutions are sampled using a Gaussian distribution parametrized by a 
mean vector and a covariance matrix. In state-of-the art ESs, those parameters are itera- 
tively adapted using the ranking of the candidate solutions w.r.t. the objective function. 
Consequently, ESs are invariant to applying a monotonic transformation to the objective 
function. Adaptive ES algorithms are successfully applied in practice and there is ample 
empirical evidence that they converge linearly towards a local optimum of the objective 
function on a wide class of functions. However, their theoretical analysis even on simple 
functions is difficult as the state of the algorithm is given by both the mean vector and 
the covariance matrix that have a stochastic dynamic that needs to be simultaneously 
controlled. Their linear convergence to local optima is so far only proven for functions 



that are composite of a monotonic transformation with a convex quadratic function — 
hence function with a single optimum — for rather simple search algorithms compared 
to the covariance matrix adaptation evolution strategy (CMA-ES) that is considered as 
the state-of-the-art ES Q3-H). In this paper, instead of analyzing the exact stochastic 
dynamic of the algorithms, we consider the deterministic time continuous model under- 
lying adaptive ESs that follows from the Information-Geometric Optimization (IGO) 
setting recently introduced [5 1. 

The Information-Geometric Optimization is a unified framework for randomized 
search algorithms. Given a family of probability distributions parametrized by 9 € 0, 
the original objective function, /, is transformed to a fitness function Jg defined on 0. 
The IGO algorithm defined on performs a natural gradient ascent aiming at max- 
imizing Jg. For the family of Gaussian distributions, the IGO algorithm recovers the 
pure rank-/i update CMA-ES @, for the family of Bernoulli distributions, PBIL Q 
is recovered. When the step-size for the gradient ascent algorithm (that corresponds to 
a learning rate in CMA-ES and PBIL) goes to zero, we obtain an ordinary differential 
equation (ODE) in 9. The set of solutions of this ODE, the IGO-flow, consists of contin- 
uous time models of the recovered algorithms in the limit of the population size going 
to infinity and the step-size (learning rate for ES or PBIL) to zero. 

In this paper we analyze the convergence of the IGO-flow for isotropic ESs where 
the family of distributions is Gaussian with covariance matrix equal to an overall vari- 
ance times the identity. The underlying algorithms are step-size adaptive ESs that re- 
semble ESs with derandomized adaptation [8 | and encompass xNES [9| and the pure 
rank-// update CMA-ES with only one variance parameter [6 1. Previous works have pro- 
posed and analyzed continuous models of ESs that are solutions of ODEs H 1 Oil 1 11 using 
the machinery of stochastic approximation 01211161 . The ODE variable in these stud- 
ies encodes solely the mean vector of the search distribution and the overall variance 
is taken to be proportional to H(Wf) where H is a smooth function with H(0) = 0. 
Consequently the model analyzed looses invariance to monotonic transformation of the 
objective function and scale-invariance, both being fundamental properties of virtually 
all ESs. The technique relies on the Lyapunov function approach and assumes the sta- 
bility of critical points of the ODE 11 1 Oil 111 . In this paper, our approach also relies on 
the stability of the critical points of the ODE that we analyze by means of Lyapunov 
functions. However one difficulty stems from the fact that when convergence occurs, 
the variance typically converges to zero which is at the boundary of the definition do- 
main 0. To circumvent this difficulty we extend the standard Lyapunov method to be 
able to study stability of boundary points. 

Applying the extended Lyapunov's method to the IGO-flow in the manifold of 
isotropic Gaussian distributions, we derive a sufficient condition on the so-called weight 
function w — parameter of the algorithm and usually chosen by the algorithm designer — 
so that the IGO-flow converges to the global minimum independently of the starting 
point on objective functions that are composite of a monotonic function with a convex 
quadratic function. We will call those functions monotonic convex-quadratic-composite 
in the sequel. We then extend this result to functions that are the composition of a mono- 
tonic transformation and a twice continuously differentiable function, called monotonic 
C 2 -composite in the rest of the paper. We prove local convergence to a local optimum of 



the function in the sense that starting close enough from a local optimum, with a small 
enough variance, the IGO-flow converges to this local optimum. 

The rest of the paper is organized as follows. In Section|2]we introduce the IGO-flow 
for the family of isotropic Gaussian distributions, which we call ES-IGO-fiow. In Sec- 
tion|3]we extend the standard Lyapunov's method for proving stability. In Section|4]we 
apply the extended method to the ES-IGO-fiow and provide convergence results of the 
ES-IGO-fiow on monotonic convex-quadratic-composite functions and on monotonic 
C 2 -composite functions. 

Notation. For A C X, where X is a topological space, we let A c denote the comple- 
ment of A in X, A° the interior of A, A the closure of A, dA = A \ A° the boundary 
of A. Let R and R d be the sets of real numbers and <i-dimensional real vectors, R^o 
and R+ denote the sets of non-negative and positive real numbers, respectively. Let ||x|| 
represent the Euclidean norm of x € R d . The open and closed balls in R d centered at 9 
with radius r > are denoted by B(9, r) and B(9, r). 

Let ^tLcb denote the Lebesgue measure on either R or R d . Let Pi and Pd be the 
probability measures induced by the one-variate and d-variate standard normal distri- 
butions, pi and pd the probability density function induced by Pi and Pd w.r.t. /iLcb- 
Let pq and Pg represent the probability density function w.r.t. /iLcb an d the probability 
measure induced by the Gaussian distribution J\f(m(9), C(9)) parameterized by 9 € 0, 
where the mean vector m(9) is in R d and the covariance matrix C{9) is a positive defi- 
nite symmetric matrix of dimension d. We sometimes abbreviate m(9(t)) and C(9(t)) 
to m(t) and C(t). Let vec : R dxd — > R d denote the vectorization operator such that 
vec : C M> [Ci t i, Ci,2, • • • , Ci,d, C^i, • • • , Cd,d] T , where Cjj is the i, j-th element of 
C. We use both notations: 9 = [m T , vec(C) T ] T and 9 = (m, C). 

2 The ES-IGO-flow 

The IGO framework for continuous optimization with the family of Gaussian dis- 
tributions is as follows. The original objective is to minimize an objective function 
/ : R d — > R. This objective function is mapped into a function on 0. Hereunder, we 
suppose that / is /iLob-measurable. Let w : [0, 1] R be a bounded, non-increasing 
weight function. We define the weighted quantile function [5 1 as 

wf(x) = w(P e [y.f(y)<f(x)}) . (1) 

The function Wl \x) is a preference weight for x according to the Pg-quantile. The 
fitness value of 9' given 9 is defined as the expectation of the preference Wg over Pg>, 
J e (9') = E x ^ Pe [W/(ac)] . Note that since W${x) depends on 9 so does J 9 {9'). The 
function Jg is defined on a statistical manifold (0,1) equipped with the Fisher metric 
I as a Riemannian metric. The Fisher metric is the natural metric. It is compatible with 
relative entropy and with KL-divergence and is the only metric that does not depend 
on the chosen parametrization. Using log-likelihood trick and exchanging the order of 
differentiation and integration, the "vanilla" gradient of Jg at 6' = 8 can be expressed 
as V giJg(9')\gi=g = E x ^p g [\Vg (x)V g \n(pg(x))] . The natural gradient, that is, the 



gradient taken w.r.t. the Fisher metric, is given by the product of the inverse of the 
Fisher information matrix Xg at 9 and the vanilla gradient, namely Xg 1 Vg< Jg(9')\g>=g. 
The IGO ordinary differential equation is defined as 

^lAsJHf'llw (2) 

Since the right-hand side (RHS) of the above ODE is independent of t the IGO ODE is 
autonomous. The IGO-flow is the set of solution trajectories of the above ODE ©. 

When the parameter 9 encodes the mean vector and the covariance matrix of the 
gaussian distribution in the following way 9 = [m T , vec(C) T ] T , the product of the 
inverse of the Fisher information matrix Xg 1 and the gradient of the log-likelihood 
Vg \n(pg(x)) can be written in an explicit form lfl4l and (fj) reduces to 



d0_ 

dt 



= / W e J (x) 



x — m 

vec((ir — m){x — m) T — C) 

The pure rank-/i update CMA-ES [6] can be considered as an Euler scheme for solving 
(01 with a Monte-Carlo approximation of the integral. Let x\, . . . , x n be samples inde- 
pendently generated from Pg. Then, the quantile Pg [y : f(y) ^ f{xi)] in (U is approx- 
imated by the number of solutions better than Xi divided by n, i.e., \{xj, j = 1, . . . , n : 

f(xj) ^ f(xi)}\/n =: Ri/n. Then Wg(xi) is approximated by — l/2)/n), 

where w is the given weight function. The Euler scheme for approximating the solu- 
tions of (01 where the integral is approximated by Monte-Carlo leads to 



Pe(dx) 



(3) 



9 t+1 =9 f + v J2 



w((Rj - l/2)/n) 



Xi — TO 

i((xi — m t )(xi — to*) t 



(4) 



where r\ is the time discretization step-size. This equation is equivalent to the pure 
rank-/i update CMA-ES when the learning rates r\ m and r/c, for the update of m* and 
C* respectively, are set to the same value 77, while they have different values in practice 
(Jim = 1 an d Vc ^ !)• The summation on the RHS in (01 converges to the RHS of (f3]) 
with probability one as A — > 00 (Theorem 4 in [5 1). 

In the following, we study the simplified IGO-flow where the covariance matrix is 
parameterized by only a single variance parameter v asC = vld- Under the parameter- 

Pg(dx). Using the 



ization d=[m\ v] T , © reduces to *jf = / W/ (as) [ \\ x J m f /d 
change of valuable z = (x — m)/ y/v, the above ODE reads 

d9 
dt 



Fg{9) , F e (9) = I W e f (m + yfiz) 



v(\\z\r/d-l) 



Pd(dz) 



and we rewrite it by part 

^ = F m (9) , F m (0) = y/vj Wj (m 
F V (B), F v {e) = vjw£[m + ^ 



dt 
(\v 
dt 



f y/vz)zP d (dz) 
vz)(\\z\\ 2 /d-l)P d (dz) 



(5) 

(6) 
(7) 



.}. We call © the ES-IGO 



The domain of this ODE is = {9 = (to, v) G R d > 
ordinary differential equation. The following proposition shows that for a Lipschitz 
continuous weight function w, solutions of the ODE (O exist for any initial condition 
9(0) £ and are unique. 



Proposition 1 (Existence and Uniqueness). Suppose w is Lipschitz continuous. Then 
the initial value problem: 4r = Fg(9), 0(0) = 9q, has a unique solution on [0, oo) for 
each 8q € 0, i.e. there is only one solution 9 : Rj>o — ► to the initial value problem. 

Proof. We can obtain a lower bound a(t) > and an upper bound b(t) < oo for v(t) 
for each t ^ under a bounded w. Similarly, we can have an upper bound c(t) < oo 
for \\m(t)\\. Then we have that (m(t),v(t)) € E(t) = {ieM' 1 : ||x|| c{t)} x {x € 
M + : a(t) ^ x ^ &(*)} and £?(t) is compact for each t ^ 0. Meanwhile, Fg is 
locally Lipschitz continuous for a Lipschitz continuous w. Since F(t) is compact, the 
restriction of Fg into E(t) is Lipschitz continuous. Applying Theorem 3.2 in [15| that 
is an extension of the theorem known as Picard-Lindelof theorem or Cauchy-Lipschitz 
theorem, we have the existence and uniqueness of the solution on each bounded interval 
[0,i\. Since t is arbitrary, we have the proposition. □ 

Now that we know that solutions of the ES-IGO ODE exist and are unique, we 
define the ES-IGO-flow as the mapping ip : K^o x — > 0, which maps (t, 9q) to 
the solution 6(t) of © with initial condition 6(0) = 0q. Note that we can extend the 
domain of Fg from = R d x R + to = R d x M^ - 11 is eas Y to see fr° m © mat me 
value of Fg(9) at 9 — (m, 0) is for any m € M. d . However, we exclude the boundary 
30 from the domain for reasons that will become clear in the next section. Because 
the initial variance must be positive and the variance starting from positive region never 
reach the boundary in finite time, solutions ip(t, ■) will stay in the domain 0. However, 
as we will see, they can converge asymptotically towards points of the boundary. 

Since Jg is adaptive, i.e. Jg ± (9) ^ Jg 2 (9) for 8\ ^ 92 in general, it is not trivial 
to determine whether the solutions to (ID converge to points where Fg(8) — o3- Even 
knowing that they converge to zeros of Fg(9) is not helpful at all, because we have 
Fg(9) — for any 9 with variance zero and we are actually interested in convergence 
to the point (x* . 0) where x* is a local optimum of /. 

Remark 1. Because of the invariance property of the natural gradient, the mean vector 
m(9) and the variance v(8) obey © and (0 under re-parameterization of the Gaussian 
distributions. Therefore, the trajectories of m and v are also independent of the param- 
eterization. For instance, we obtain the same trajectories v(9) for any of the following 
parameterizations: 9d+i = v, 8d+i = \/v, and 9d+i = \ lnw, although the trajectories 
of the parameters 9d+\ are of course different. Consequently, the same convergence 
results for m(9) and v(9) (see Section H} will hold under any parameterization. Pa- 
rameterizations 9 = (m, v) and 8 = (m, \ lnu) correspond to the pure rank-// update 
CMA-ES and the xNES with only one variance parameter. Thus, the continuous model 
to be analyzed encompasses both algorithms. 

Remark 2. Theory of stochastic approximation says that a stochastic algorithm 8 t+1 = 
9* + rjh 1 follows the solution trajectories of the ODE = E[/i* | 9* = 9] in the limit 

If Jg is not adaptive and defined to be the expectation of the objective function f(x) over Pe, 
convergence to the zeros of the RHS of lf2} is easily obtained. For example, see Theorem 12 and 
its proof in 1 1 3 J , where the solution to the system of a similar ODE whose RHS is the vanilla 
gradient of the expected objective function is derived and the convergence of the solution 
trajectory to the critical point of the expected function is proven. 



for 77 to zero under several conditions. In our setting, 9 encodes m and v and the noisy 
observation h l = Yli=i WR^g 1 ^ b ^pgt(xi), where Wi, i = 1, . . . , A, are predefined 
weights and Ri is the ranking of If we define w(p) — Yli=i w i {^ZDp* 1 (1 — P) X t 
in O, then F e {6) = E[/i< | 0* = 9} and the ODE agrees with ©. Therefore, © can 
be viewed as the limit behavior of adaptive-ES algorithms not only in the case 7/ — > 
and A —> 00 but also in the case 7/ — !• and finite A. Indeed, it is possible to bound 
the difference between {9 t ,t ^ 0} and the solution 9(-) of the ODE (0 by extending 
Lemma 1 in Chapter 9 of 1 16 1. The details are omitted due to the space limitation]! 

3 Extension of Lyapunov Stability Theorem 

When convergence occurs, the variance typically converges to zero. Hence the study 
of the convergence of the solutions of the ODE will be carried out by analyzing the 
stability of the points 9* = (x*,0). However, because points with variance zero are 
excluded from the domain 0, we need to extend classical definitions of stability to be 
able to handle points located on the boundary of 0. 

Definition 1 (Stability). Consider the following system of differential equation 

9 = F(9), 9(0) = 9 Q £D, (8) 
where F : D i-> R ds is a continuous map and D C M. d " is open. Then 9* £ D is called 

- stable in the sense of Lyapuno^ if for any e > there is 5 > such that 9q £ 
D n B(d*,6) 9(t) £ DC] B(9*,e)for all t ^ 0, where t i-> 9{t) is any 
solution of dHJ; 

- locally attractive if there is S > 0suchthat9o £ DnB(9*,S) lim t _ ! . 00 ||#(t) — 

= Ofor any solution t ^ 9(t) of ©; 

- globally attractive i/lim^oo \\9(t) — 9*\\ = § for any 9q £ D and any solution 
t 6(t) of ®; 

- locally asymptotically stable if it is stable and locally attractive; 

- globally asymptotically stable if it is stable and globally attractive. 

We can now understand why we need to exclude points with variance zero from 
the domain 0. Indeed, points with variance zero are points from where solutions of the 

2 When H(8) is a (natural) gradient of a function, the stochastic algorithm is called a stochastic 
gradient method. The theory of stochastic gradient method (e.g., [17]) relates the convergence 
of the stochastic algorithm with the zeros of H(6). However, it is not applicable to our algo- 
rithm due to the reason mentioned above Remark [TJ 

3 Usually, stability is denned for stationary points. However, it is not the only case that a point 
is stable in our definition. Let 8* £ D be a stable point. If 9* £ D or F can be prolonged 
by continuity at 8* as lim fl ^ fl . F(9) = F(8*), then F(9*) = 0. That is, 9* is a stationary 
point. However, lime^e' F(9) does not always exist for a stable boundary point 9* £ dD. 
For example, consider the ODE: d#i/dt = —6\/y/ 8\ + 6\, d#2/dt = —#2- The domain is 
R x R+, Then, \9\ \ and 62 are monotonically decreasing to zero. Hence, (0,0) is globally 
asymptotically stable. However, lim e ^( .o) F (9) does not exist. 



ODE will never move because Fg(8) = 0. Consequently, if we include points (x,0) 
in 0, none of these points can be attractive as in a neighborhood we always find 9q = 
(xo, 0) such that a solution starting in 0o stays there and cannot thus converge to any 
other point. 

A standard technique to prove stability is Lyapunov's method that consists in finding 
a scalar function V : K d£) — > K^ that is positive except for a candidate stable point 0* 
with V(9*) = 0, and that is monotonically decreasing along any trajectory of the ODE. 
Such a function is called Lyapunov function (and is analogous to a potential function in 
dynamical systems). Lyapunov's method does not require the analysis of the solutions 
of the ODE. The standard Lyapunov's stability theorem gives practical conditions to 
verify that a function is indeed a Lyapunov function. However, because our candidate 
stable points are located on 80, we need to extend this standard theorem. 

Lemma 1 (Extended Lyapunov Stability Method). Consider the autonomous system 
©, where F : D -> R d <> is a map and D C R de is the open domain of 9. Let 9* 6 D 
be a candidate stable point. Suppose that there is an R > such that 
(Al): F(9) is continuous on D n B(9*,R); 

(A2): there is a continuously differentiable V : M. d " — > K such that for some strictly 
increasing continuous function a : R+ — > K + satisfying lirrip^oo a(p) = oo, 

V(6*) = 0, V{9) > a(||0-0*||) V9 e DDB(9*,R)\{9*} (9) 

and W(0) T F(0)<O V0 € D n B{0* , R) \ {9*}; (10) 

(A3): for any r± and r2 such that < n ^ r2 < R, if a solution 0(-) to © starting 
from D riJr2 = {9 G D : n ^ ||0 — 0*|| ^ ^2} sfays in D ri _ r2 for t € [0, oo), then 
there is a T a compact sef E 1 C D ri . r2 swc/z f/iaf 0(f) € E for t € [T, 00). 

Then, 9* is locally asymptotically stable. If (Al) awe/ (A2) hold with D replacing 
D H 5(0* , i?) ana? (A3) /zoMs vw'f/z R — 00, then 9* is globally asymptotically stable. 

Proof. We follow the proof of Theorem 4.1 in [15]. We have from assumptions (Al) 
and (A2) that there is 8 < R such that 0* is stable and V(9(t)) -> V ^ for each 
O G D n B(9*,S). Moreover, under (Al) and (A2) with D replacing D n B(9*,R) 
we have that V(9(t)) -> ^ for each O G D. Since lim^oo 7(0(i)) -> implies 
lim^oo || - 0*|| = by ©, it is enough to show V — 0. We show V = by 
contradiction argument. Assume that V > 0. Then, we have that for each 9 E D (or 
G D <1 B(9* ,5) for the case of local asymptotic stability) there are r\ and r2 such that 
< r\ ^ r2 (^ 8) and 0(f) lies in D ri , r2 for i ^ 0. Note that D ri:T2 is not necessarily 
a compact set. This is different from Theorem 4.1 in [15]. By assumption (A3) we have 
that there is a compact set E and T ^ such that 0(f) G E for t > T. Since is 
continuously differentiable and i 7, is continuous, VV^(0) T i 7 '(0) is continuous. Then, the 
function n- T^(0) T i 7 '(0) has its maximum —/3 on the compact _E and —/3 < by (fTOb . 
This leads to V(0(i)) < V(0(T)) - P(t - T) 1 -00 as f 00. This contradicts the 
hypothesis that F > 0. Hence, F = for any O G D (or G D n B{9*,8)). □ 

4 Convergence of the ES-IGO-flow 

In this section we study the convergence properties of the ES-IGO-flow ip : (t, 0q) h- 
9{t), where 0(-) represents the solution to the ES-IGO ODE © with initial value 0(0) = 



6*o, i.e., dy = Fe((p(t,6o)) and ip(0, 9o) — 9q. By the definition of asymptotic 
stability, the global asymptotic stability of 9* € implies the global convergence, that 
is, lim t _ i . 00 ip(t,8o) — 9* for all 6*o € &■ Moreover, the local asymptotic stability of 
9* € implies the local convergence, that is, 3<5 > such that lim t _ i . 00 ip(t, 9q) = 9* 
for all 9q 6 n B(9* ,5). We will prove convergence properties of the ES-IGO-flow 
by applying Lemma [T] In order to prove our result we need to make the following 
assumption on w: 

(Bl): w is non-increasing and Lipschitz continuous with w(0) > w(l); 
(B2): J w(Px[y : y «; z])(z 2 /d - l/d)Px(dz) = a > 0. 

Assumption (Bl) is not restrictive. Indeed, the non-increasing and non-constant 
property of w(-) is a natural requirement and any weight setting in (@) can be ex- 
pressed, for any given population size n, as a discretization of some Lipschitz con- 
tinuous weight function. Assumption (B2) is satisfied if and only if the variance v 
diverges exponentially on a linear function. In fact, F v (9) defined in (01 reduces to 
v J w(Pi[y : y < z])(z 2 /d- l/d)P x {dz) when f(x) = a T xforVa € M d \{0}andwe 
have that v = av and the solution is v(t) = vo exp(at). Then, v(t) ~ > oo as t — > oo. 
Assumption (B2) holds, for example, if w is convex and not linear. 

Let Q be the set of strictly increasing functions g : M — > R that are /iLcb-measurable 
and C 2 be the set of twice continuously differentiable functions h : M. d —> K that are 
/XLeb-measurable. Under (Bl) and (B2), we have the following main theorems. 

Theorem 1. Suppose that the objective function f is a monotonic convex- quadratic- 
composite function g o h, where g £ Q and h is a convex quadratic function x i— > 
(x — x*) T A(x — x*)/2 where A is positive definite and symmetric. Assume that (Bl) 
and (B2) hold. Then, 9* — (x*, 0) G is the globally asymptotically stable point of 
the ES-IGO. Hence, we have the global convergence of<p(t, 9q) to 9*. 

Proof. Since the ES-IGO does not explicitly utilize the function values but uses the 
quantile P e [y : f{y) < f(x)] which is equivalent to P e [y : g^ 1 o f(y) < g- 1 o f(x)], 
without loss of generality we assume / = h. 

According to LemmaQ] it is enough to show that (Al) and (A2) hold with D(= 0) 
replacing DO B(9* , R) and (A3) holds with R = oo. As is mentioned in the proof of 
Proposition Q] Fg is locally Lipschitz continuous for a Lipschitz continuous w. Thus, 
(Al) is satisfied under (Bl). 

We can choose as a Lyapunov candidate function V(9) = 2~ZiLi( m i — x * ») 2 +d-v — 
||m — x* 1 1 2 + Tr(uJd). All the conditions on V described in (A2) are obvious except 
for the negativeness of W(9)' T Fg{9). To show the negativeness, rewrite F$(9) as 
/ Wl (m + y / vz)Fg(9, z)Pd(dz). The idea is to show the (strictly) negative correla- 
tion between Wl (m + \fvz) and W(9) T Fg(9, z) by using an extension of the result 
in [18] Chapter 1] and apply the inequality / W / /(m + ^z)VV{9) T F e (9, z)P d (dz) < 
JW^{m + y/Iiz)P d {dz) J VV(6) T Fe(6,z)P d (dz) = 0. We use the non-increasing 
property of w with w(0) > w(l) in (Bl) to show the negative correlation. 

To prove (A3), we require (B2). Since a continuously differentiable function can 
be approximated by a linear function at any non-critical point x, the natural gradient 
Fg is approximated by that on a linear function in a small neighborhood of (x, 0). We 
use the property pheb[x : f(x) = f] = to approximate Fg. As is mentioned above, 



(B2) implies F v on a linear function is positive. By using the approximation and this 
property, we can show that E = D ri>r2 n {9 : v v} satisfies (A3) for some v > 0. □ 

We have that for any initial condition 6(0) = (m,Q,Vo), the search distribution Pg 
weakly converges to the Dirac measure <5 X * concentrated at the global minimum point 
x*. This result is generalized to monotonic C 2 -composite functions using a quadratic 
Taylor approximation. However, global convergence becomes local convergence. 

Theorem 2. Suppose that the objective function f is a monotonic C 2 -composite func- 
tion g o h, where g £ Q and h € C 2 has the property that [iLcb[x : h(x) = s] = 
for any s € R. Assume that (Bl) and (B2) hold. Let x* be a critical point of h, i.e. 
V/i(x*) = 0, with a positive definite Hessian matrix A. Then, 9* — (x* , 0) <E is a lo- 
cally asymptotically stable point of the ES-IGO. Hence, we have the local convergence 
ofip(t, 9q) to 9*. Moreover, if5i is not a critical point of h(-), for any 9q G 0, (p(t, 8q) 
will never converge to 9 = (x, 0). 

Proof. As in the proof of TheoremQ] we assume f = h without loss of generality. The 
proofs of (Al) and (A3) carry over from Theorem [T]because we only used the property 
IJ,L e h[x : f(x) = f] = 0. To show (A2), we use the Taylor approximation of the ob- 
jective function /. Since f is approximated by a quadratic function in a neighborhood 
of a critical point x*, we approximate the natural gradient by the corresponding natu- 
ral gradient on the quadratic function. Then, employing the same Lyapunov candidate 
function as in the previous theorem we can show (A2). Because of the approximation, 
we only have local asymptotic stability. The last statement of Theorem|2]is an immedi- 
ate consequence of the approximation of the natural gradient and (B2). □ 

We have that starting from a point close enough to a local minimum point x* with a 
sufficiently small initial variance, the search distribution weakly converges to <5 X * . It is 
not guaranteed for the parameter to converge somewhere when the initial mean is not 
close enough to the local optimum or the initial variance is not small enough. Theorem|2] 
also states that the convergence (m(t),v(t)) — > (x, 0) does not happen for x such that 
V/i(x) ^ 0. That is, the continuous time ES-IGO does not prematurely converge on a 
slope of the landscape of /. 

5 Conclusion 

In this paper we have proven the local convergence of the continuous time model associ- 
ated to step-size adaptive ESs towards local minima on monotonic C 2 -composite func- 
tions. In the case of monotonic convex-quadratic-composite functions we have proven 
the global convergence, i.e. convergence independently of the initial condition (pro- 
vided the initial step-size is strictly positive) towards the unique minimum. Our analysis 
relies on investigating the stability of critical points associated to the underlying ODE 
that follows from the Information Geometric Optimization setting. We use a classical 
method for the analysis of stability of critical points, based on Lyapunov functions. We 
have however extended the method to be able to handle convergence towards solutions 
at the boundary of the ODE definition domain. We believe that our approach is general 



enough to handle more difficult cases like the CMA-ES with a more general covariance 
matrix. We want to emphasize that the model we have analyzed is the correct model 
for step-size adaptive ESs as the ODE encodes both the mean vector and step-size and 
preserves fundamental invariance properties of the algorithm. 
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